Clustering Document Images Using Graph Summaries
نویسندگان
چکیده
Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervized classification method, we can group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we found frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of “symbols” represents the description of a document. We present results obtained on a corpus of graphical document images.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملNN Networks and Automated Annotation for Browsing Large Image Collections from the World Wide Web
This paper outlines a system for searching and browsing 1.14 million images from the World Wide Web (WWW) based on their visual content. At the heart of the system lies an automatically constructed network of images that can be navigated quickly by following its edges. The browsing experience is enhanced in a number of ways including multidimensional scaling of the graph neighbourhood for displ...
متن کاملCross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment
Identifying and linking named entities across information sources is the basis of knowledge acquisition and at the heart of Web search, recommendations, and analytics. An important problem in this context is cross-document coreference resolution (CCR): computing equivalence classes of textual mentions denoting the same entity, within and across documents. Prior methods employ ranking, clusterin...
متن کاملA News Summarization System using Fuzzy Graph Based Document Model
This paper describes a news summarization system using the Fuzzy Graph based Document Model. News articles are modelled as fuzzy graphs whose nodes are sentences and edges are weighted by the fuzzy similarity measure between the sentences. The similarity between sentences is in between 0 and 1. Centrality of the graph retrieves important sentences. The proposed system produces summaries by Eige...
متن کاملIndexation of Document Images Using Frequent Items
Documents exist in different formats. When we have document images, in order to access some part, preferably all, of the information contained in that images, we have to deploy a document image analysis application. Document images can be mostly textual or mostly graphical. If, for a user, a task is to retrieve document images, relevant to a query from a set, we must use indexing techniques. Th...
متن کامل